Port switch service

ABSTRACT

Provided is a port switch service (Port Switch Service, PSS), including a server cluster and a client cluster, wherein a master node in the current cluster is elected from the server cluster through a quorum algorithm and is guaranteed to be unique within a specified period in a lease form; the client cluster contains various client nodes needing to use the PSS, and each client node can establish connection with the master node as needed; and each of the client node is identified in the server cluster through the unique node ID. The port switch service is a message routing service integrating distributed coordination functions such as fault detection, service electing, service discovery, and distributed lock. By sacrificing reliability under the extreme condition, the port switch service realizes very high performance, capacity and concurrency capability in the premise of ensuring strong consistency, high availability and scalability.

BACKGROUND OF THE INVENTION 1. Field of the Invention

The present invention relates to a distributed coordination system, inparticular to a port switch service.

2. The Prior Arts

Traditional distributed coordination services are usually implementedusing quorum-based consensus algorithms like Paxos and Raft. Their mainpurpose is to provide applications with a high-availability service foraccessing distributed metadata KV. The distributed coordination servicessuch as distributed lock, message dispatching, configuration sharing,role election and fault detection are also offered based on theconsistent KV storage. Common implementations of distributedcoordination services include Google Chubby (Paxos), Apache ZooKeeper(Fast Paxos), etcd (Raft), Consul (Raft+Gossip), and etc.

Poor performance and high network consumption are the major problemswith consensus algorithms like Paxos and Raft. For each access to theseservices, either write or read, it requires three times of broadcastingwithin the cluster to confirm in voting manner that the current accessis acknowledged by the quorum. This is because the master node needs toconfirm it has the support from the majority while the operation ishappening, and to confirm it remains to be the legal master node.

In real cases, the overall performance is still very low and has strongimpact to network IO, though the read performance can be optimized bydegradation the overall consistency of the system or adding a leasemechanism. If we look back at the major accidents happened in Google,Facebook or Twitter, many of them are caused by network partition orwrong configuration (human error). Those errors lead to algorithms likePaxos and Raft broadcasting messages in an uncontrollable way, thusdriving the whole system crashed.

Furthermore, due to the high requirements of network IO (both throughputand latency), for Paxos and Raft algorithm, it is difficult (andexpensive) to deploy a distributed cluster across multiple data centerswith strong consistency (anti split brain) and high availability. Asexamples: Aug. 20, 2015 Google GCE service interrupted for 12 hours andpermanently lost part of data; May 27, 2015 and Jul. 22, 2016 Alipayinterrupted for several hours; As well as the Jul. 22, 2013 WeChatservice interruption for several hours, and etc. These major accidentsare due to product not implement the multiple active IDC architecturecorrectly, so a single IDC failure led to full service off-line.

SUMMARY OF THE INVENTION

The present invention aims to solve the problems by providing a portswitch service (PSS) and also providing distributed coordinationfunctions such as fault detection, service electing, service discovery,and distributed lock, as well as the capabilities of strong consistency,high availability and anti split brain with same level as the Paxos andRaft algorithms. Performance and paralleling processing capability whichare tens of thousands times of the formers are provided because highconsumption operations such as nearly all network broadcastings, anddisk I/O are eliminated. Large-scale distributed cluster system acrossmultiple IDC can be built in the premise without additional requirementsfor the aspects of network throughput, delay, etc.

In order to realize the purposes, the technical scheme of the presentinvention is that: A port switch service (Port Switch Service, PSS)includes a server cluster and a client cluster, wherein a master node inthe current cluster is elected from the server cluster through a quorumalgorithm and is guaranteed to be unique within a specified period in alease form; the client cluster contains various client nodes needing touse the PSS, and each client node can establish connection with themaster node as needed; and each of the client node is identified in theserver cluster through the unique node ID.

Further, the server cluster employs a mode of one master node plus aplurality of slave nodes, or a mode of one master node plus a pluralityof slave nodes plus a plurality of arbiter nodes.

Further, each client (a server within an application cluster) nodemaintains at least one TCP Keep-Alive connection with the port switchservice.

Further, any number of ports can be registered for each connection. Aport is described using a UTF-8 character string, and must be globallyunique.

Further, PSS offers the following application programming interface(API) primitives: Waiting for Message (WaitMsg), Relet, PortRegistration (RegPort), Port Un-registration (UnRegPort), MessageSending (SendMsg), Port Query (QueryPort), Node Query (QueryNode) andClear.

Further, connection of the client cluster and the port switch serviceincludes message receiving connection and message sending connection.

With adoption of the technology, compared with the prior art, thepresent invention has the following positive effects:

The present invention eliminates master consumptions, such as networkbroadcasting, disk I/O and etc., following each access request in thetraditional distributed coordination algorithms such as Paxos, and Raft,and thus the whole performance of the system is remarkably improved (bythousands and even ten thousands times, see section [0006] and [0060]).Not only that, but the present invention supports a batch requestmechanism since a vote does not need to be initiated for each requestsingly any more, and this greatly increases the network utilizationratio (by several tens of times), and further strengthens the systemperformance expression under a heavy load (during busy business, seesection [0006], [0032]-[0037], and [0060]).

The present invention integrates standard message routing function intodistributed coordination services such as service electing (portregistration), service discovery (send message and query portinformation), fault detection (relet timeout) and distribute locking(port registration and unregister notification). This high-performancemessage switch service has distributed coordination capabilities. Also,it can be purely used as a service electing and discovery service withfault detection.

The design of the present invention of eliminating unrelated functionssuch as a configuration management database (CMDB). Further strengthsthe capacity and the performance of the system (equivalent to a mannerof only retaining K:Key and removing a part of V: Value in thetraditional KV storage mechanism; or only retaining path information andremoving values in the traditional tree data structure).

The present invention maintains a message buffering queue for eachconnection and saves all port definitions and messages to be forwardedin the master node's memory (Full in-memory); any data replication andstate synchronization consumption are not needed among the master nodeand slave nodes; and information sending and receiving are both realizedby using pure asynchronous I/O, and thus high-concurrency andhigh-throughput message forwarding performance can be provided.

The present invention has the scalability, and when single-nodeperformance gets a bottleneck, service can scale out by cascadingupper-level port switch service, similar to the three layers (access,aggregation, and core) switch architecture in IDC.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a structure schematic diagram with one master node plus aplurality of slave nodes of the port switch service of the presentinvention.

FIG. 2 is a structure schematic diagram with one master node plus aplurality of slave nodes plus a plurality of arbiter nodes of the portswitch service of the present invention.

FIG. 3 is a structure schematic diagram of horizontally-scaled PSSserver cluster and client cluster of a tree structure.

FIG. 4 is a using example of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Embodiments of the present invention are further described below inconjunction with drawings.

In order to make the purpose, technical scheme and advantages of thepresent invention more clearly, the present invention will be describedin detail in conjunction with functional diagrams and flow diagrams. Thefollowing schematic embodiments and descriptions thereof are provided toillustrate the present invention, and do not constitute any limitationto the present invention.

A port switch service (Port Switch Service, PSS) includes a servercluster and a client cluster, wherein a master node in the currentcluster is elected from the server cluster through a quorum algorithmand is guaranteed to be unique within a specified period in a leaseform; the client cluster contains various client nodes needing to usethe PSS, and each client node can establish connection with the masternode as needed; and each of the client node is identified in the servercluster through the unique node ID.

Referring to FIGS. 1 and 2, preferably, the server cluster employs amode of one master node plus a plurality of slave nodes, or a mode ofone master node plus a plurality of slave nodes plus a plurality ofarbiter nodes.

Preferably, each client (a server within an application cluster) nodemaintains at least one TCP Keep-Alive connection with the port switchservice.

Preferably, any number of ports can be registered for each connection. Aport is described using a UTF-8 character string, and must be globallyunique. Registering a port will fail if the port is already registeredby another client node.

Preferably, PSS offers the following application programming interface(API) primitives: Waiting for Message (WaitMsg), Relet, PortRegistration (RegPort), Port Un-registration (UnRegPort), MessageSending (SendMsg), Port Query (QueryPort), Node Query (QueryNode) andClear.

PSS offers the following API primitives:

Waiting for Message (WaitMsg): Each node within the cluster should keepat least one TCP Keep-Alive connection with the PSS, and call thismethod to waiting for messages pushed by the server. This methodupgrades the current connection from a message transmission connectionto a message receiving connection.

Each node number corresponds to only one message receiving connection.If a node attempts to generate two message receiving connections at thesame time, the earlier connection will be disconnected, and all portsbound with that node will be unregistered.

Relet: If PSS does not receive a relet request from a message receivingconnection for a specified time period, it will treat the node as beingoffline, and will release all the ports associated with this node. Arelet operation is used for periodically providing heartbeat signals toPSS.

Port Registration (RegPort): After a connection is established, theclient should send request to PSS to register all the ports associatedwith the current node. A port registration request can contain anynumber of ports to be registered. PSS will return a list of ports(already occupied) that are failed to be registered. The caller canchoose to subscribe port release notification for the ports failed to beregistered.

Each time a message receiving connection is re-established throughcalling WaitMsg, the server need to register all the relevant portsagain.

Port Un-registration (UnRegPort): To unregister the ports associatedwith the current node. A request can contain several ports for batchun-registration. The BPASS service maintains a port un-registrationnotification list for each port under it. This list records the clientsthat are interested in the port unregistered event. When the port isunregistered (whether it is caused by an intentionally operation or dueto a failure), PSS service will follow the list and push the portun-registration notification to corresponding clients.

Message Sending (SendMsg): To send a message (BLOB) to the specifiedport. The message format is transparent to PSS. If the specified port isan empty string, the message will be broadcasted to all nodes withinPSS. If the specified port does not exist, the message will be discardedquietly. The client can package multiple message sending commands withina single network request for batch sending, The PSS server will packagemessages sent to the same node automatically for batch message push.

Port Query (QueryPort): To query node number and network addressassociated with the node currently owns the specified port. Thisoperation is used for service discovery with fault detection. Thismethod is not needed for message sending (SendMsg) because the operationis automatically executed while delivering a message. A request cancontain several ports for batch query.

Node Query (QueryNode): To query information (e.g. network address)associated with the specified node. This operation is mainly used fornode resolving with fault detection. A request can contain several nodesfor batch query.

Clear: Executing clearing operation before disconnection of messagereceiving connection. Similar to the FIN signal in the four-wayhandshake of TCP protocol. Disconnected a message receiving connectionwithout calling of this primitive successfully, will be judged to be inabnormal disconnection by the port switch service, at this time, all theports owned by the client cannot be released immediately and can only bereleased when being delayed to node timeout duration of the client.

Thus, a port can be strictly guaranteed to have strong consistency of atmost only one owner at any given time. Even if the client does not usethe TCP protocol to connect PSS, or the client make the connectionthrough some intermediate nodes such as a gateway, or a proxy.

Preferably, data of all the ports and messages is only stored in thememory of the master node of the PSS server cluster. The PSS master nodeneither writes port information in the disk nor synchronizes the dataamong other nodes in the PSS server cluster, such as salve nodes, andarbiter nodes (single-point full-in-memory mode).

Preferably, connection of the client cluster and port switch serviceincludes message receiving connection and message sending connection.

Message receiving connection (1:1): It uses WaitMsg method for noderegistration and message pushing; keeps occupying all ports belong tocurrent node using Relet, and use the Clear primitive to clean up beforenormal disconnection. Each node within the cluster should keep and onlykeep a single message receiving connection, which is a Keep-Aliveconnection. It is recommended to always keep the connection active andto complete Relet in a timely manner, because re-establishing areceiving connection will require service electing again (portregistration).

Message sending connection (1:N): All connections that are not upgradedusing WaitMsg API are deemed as sending connections. They use primitiveslike RegPort, UnRegPort, SendMsg and QueryPort for non-pushing requests,without the need for using Relet to keep heartbeat. It also does notneed to use the Clear command to clean up. Each node within the clustermaintains a message sending connection pool, so that the worker threadscan stay in communication with the port switch service.

A horizontal scaling (scale out) mode of the port switch server clusteris shown in FIG. 3, and during cascade deployment, the leaf nodes in thetree structured PSS server clusters will serve the respective clientclusters and supply distributed coordination service for them. Theseleaf clusters are in charge of processing all local requests andescalate all the requests exceeding the local strategy range to moreadvanced server clusters until the requests can be processed andreturned back down with a result level by level (the result can becached level by level for improving the efficiency).

The strategy range is limited by the name space, it is stipulated thatone client node can only be registered on ports under a local name spaceand a superior name space of the client node, but cannot be registeredon ports under a brother name space or a collateral name space. Messagesending is not limited: one client node can send messages to any portand node in the system.

Since, in practice, most of requests sent by the PSS client nodes areall local requests (only local PSS clusters are involved), suchcascading mode not only can efficiently realize horizontal scaling, butalso can be used for deploying extra-long distance offsite clustersamong different Regions. In this case, the cost of communications acrossregions is high, and the consumption of the communications across theregions can be effectively reduced by deploying a set of leaf clustersfor each region respectively (all the leaf clusters are uniformlyconnected to superior clusters in different levels).

Referring to FIG. 4, the PSS server is formed by clusters in athree-level cascading structure, wherein the top-level cluster is incharge of port change (registration, unregistration, etc.) operationsand message forwarding across large areas (Asia-Pacific area, NorthAmerica area, etc.) in the global name space.

A second level in the cascading structure corresponds to various largeareas such as Asia-Pacific area, and North America area, and acorresponding PSS server cluster is in charge of each large area,wherein each cluster can process port change in its own large area andthe message forwarding requests among various regions in the large area.The clusters are connected to the top-level clusters upward and supplyservice for PSS in different regions in the large area downward.

A third level in the cascading structure corresponds to various regionsin the large area respectively, such as Shanghai region, Beijing region,and San Francisco region. One leaf-level PSS server cluster is in chargeof managing each region. Port change and message forwarding requestswithin the regions can be resolved by the corresponding leaf PSS servercluster without requirement for the upper-level clusters. Only therequests exceeding the local range need to be escalated to theupper-level cluster for processing. For example, message switch and portregistration requests in Beijing can be processed by the leaf PSS serverclusters in Beijing; a message send by one Beijing node to one Shanghainode needs to be transferred by the Asia-Pacific cluster; and a messagesend by one Beijing node to one San Francisco node needs to betransferred in a way of the Asia-Pacific area cluster, the top-levelcluster, the North America area cluster, etc.

Correspondingly, the client nodes in Beijing can be registered on theports of the name spaces belonging to Beijing, Asia-Pacific area andglobal area (top-level), but cannot be registered on the ports of thename spaces in the range of Shanghai, North America, Vancouver, etc.(Note: descriptions for FIG. 4 are all examples, division rulescontaining the cascading structure with any levels and any regions canbe used as needed in practical conditions).

Seen from this, the present invention has the following characteristics:

Availability: High availability insurance by completing fault detectionand master/slave switching within two seconds; quorum-based electionalgorithm, avoiding split brain due to network partition.

Consistency: A port can be owned by only one client node at any giventime. It is impossible that multiple nodes can succeed in registeringand occupying the same port simultaneously.

Reliability: All messages sent to an unregistered port (the port doesnot exist, or is unregistered or expired) are discarded quietly.

The system ensures that all messages sent to registered ports are inorder and unique, but messages may get lost in some extreme conditions:

Master/slave switching due to the port switch service is unavailable:All messages queued to be forwarded will be lost. All the alreadyregistered nodes need to register again, and all the already registeredports (services and locks) need election/acquirement again (register).

A node receiving connection is recovered from disconnection: After themessage receiving connection was disconnected and then re-connected, allthe ports that were ever registered for this node will become invalidand need to be registered again. During the time frame fromdisconnection to re-connection, all messages sent to the ports that arebound with this node and have not been registered by any other nodeswill be discarded.

Each time the PSS master node offline due to a failure, all registeredports will forcibly become invalid, and all active ports need to beregistered again.

For example, if a distributed Web server cluster treat a user as theminimum schedule unit, and register a message port for each user who islogged in, after the master node of PSS is offline due to a failure,each node will know that all the ports it maintains have became invalidand it need to register all active (online) users again with the new PSSmaster.

This may seem to make the system performance fluctuations, but it doesnot a matter: this operation can be completed in a batch. Through thebatch registration interface, it is permitted to use a single request toregister or unregister as much as millions of ports simultaneously,improving request processing efficiency and network utilization. On aXeon processer (Haswell 2.0 GHz) which was release in 2013, PSS is ableto achieve a speed of 1 million ports per second and per core (perthread). Thanks to the concurrent hash table (each arena has its ownfull user mode reader/writer lock optimized by assembly) which wasdeveloped by us, we can implement linear extension by simply increasingthe number of processor cores.

Specifically, under an environment with 4-core CPU and Gigabit networkadapter, PSS is capable of registering 4 millions of ports per second.Under an environment with 48-core CPU and 10 G network adapter, PSS isable to support registering nearly 40 millions of ports per second (thename length of each of the ports is 16 bytes), almost reaching the limitfor both throughput and payload ratio. There is almost no impact tosystem perforce, because the above scenarios rarely happen andre-registration can be done progressively as objects being loaded.

To illustrate this, we consider the extreme condition when one billionusers are online simultaneously. Though applications register adedicated port (for determining user owner and for message distribution)for each of the users respectively, it is impossible that all these onebillion users will press the refresh button simultaneously during thefirst second after recovering from fault. Conversely, these online userswill usually return to the server after minutes, hours or longer, whichis determined by the intrinsic characteristics of Web applications(total number of online users=the number of concurrent requests persecond×average user think time). Even we suppose all these users arereturned within one minute (the average think time is one minute) whichis a relatively tough situation, PSS only need to process 16 millionregistration requests per second, which means a 1 U PC Server with16-core Haswell and 10 G network adapter is enough to satisfy therequirements.

As a real example, the official statistics show there were 180 millionactive users (DAU) in Taobao.com on November 11 (“double 11”), 2015, andthe maximum number of concurrent online users is 45 million. We can makethe conclusion that currently the peak number of concurrent users forhuge sites is far less than the above mentioned extreme condition. PSSis able to support with ease even we increase this number tens of times.

The following table gives characteristic comparisons between PSS andsome distributed coordination products that utilize traditionalconsensus algorithms like Paxos and Raft:

Item PSS ZooKeeper, Consul, etcd . . . Availability High availability;supports High availability; supports multiple active IDC. multipleactive IDC. Consistency Strong consistency; the master Strongconsistency; node is elected by the quorum. multi-replica. ConcurrencyTens of millions of concurrent Up to 5,000 nodes. connections; hundredsof thousands of concurrent nodes. Capacity Each 10 GB memory can holdUsually supports up to tens about 100 million message ports; ofthousands of key-value each 1 TB memory can hold about pairs; thisnumber is even ten billion message ports; smaller when change two-levelconcurrent Hash table notification is enabled. structure allows capacityto be linearly extended to PB level. Delay The delay per request withinthe Because each request same IDC is at sub-millisecond requires threetimes of level (0.5 ms in Aliyun.com); the network broadcasting anddelay per request for different multiple times of disk I/O IDCs withinthe same region is at operations, the delay per millisecond level (2 msin operation within the same Aliyun.com). IDC is over 10 milliseconds;the delay per request for different IDCs is more longer (see thefollowing paragraphs). Performance Each 1Gbps bandwidth can Thecharacteristics of the support nearly 4 million times of algorithmitself make it port registration and unregistration impossible tosupport batch operations per second. On an operations; less than 100entry-level Haswell processor requests per second. (2013), each core cansupport 1 (Because each atomic million times of the above operationrequires three mentioned operations per second. times of network Theperformance can be linearly broadcasting and multiple extended byincreasing bandwidth times of disk I/O operations, and processor core.it is meaningless to add the batch operations supporting.) Network Highnetwork utilization: both the Low network utilization: utilizationserver and client have batch each request use a separate packingcapabilities for port package (TCP Segment, IP registration, portunregistration, Packet, Network Frame), port query, node query andNetwork payload ratio is message sending; network payload typically lessthan 5%. ratio can be close to 100%. Scalability Yes: can achievehorizontal scaling No: more nodes the cluster in cascading style.contains (the range for broadcasting and disk I/O operations becomeswider), the worse the performance is. Partition The system goes offlinewhen The system goes offline tolerance there is no quorum partition, butwhen there is no quorum broadcast storm will not occur. partition. It ispossible to produce a broadcast storm aggravated the network failure.Message Yes and with high performance: None. dispatching both the serverand client support automatic message batching. Configuration No: PSSbelieves the configuration Yes: Can be used as a simple Management datashould be managed by CMDB. This confusion on dedicate products likeRedis, the functions and MySQL, MongoDB and etc. Of responsibilitiesmaking course the distribute coordination capacity and performance tasksof these CMDB products worse. (e.g. master election) can still be doneby the PSS. Fault recovery Need to re-generate a state There is no needto machine, which can be completed re-generate a state machine. at tensof millions of or hundreds of millions of ports per second; practically,this has no impact on performance.

Among the above comparisons, delay and performance mainly refers towrite operations. This is because almost all of the meaningfuloperations associated with a typical distributed coordination tasks arewrite operations:

From service coordination From distributed lock Operations perspectiveperspective Port registration Success: service election Success: lockacquired succeeded; becomes the owner of successfully. the service.Failed: failed to acquire the Failed: successfully discover the lock,returning the current current owner of the service. lock owner. PortReleases service ownership. Releases lock. unregistration UnregistrationThe service has offline; can Lock is released; can attempt notificationupdate local query cache or to acquire the lock again. participate inservice election.

As shown in the above table, the port registration in PSS corresponds to“write/create KV pair” in traditional distributed coordination products.The port unregistration corresponds to “delete KV pair”, and theunregistration notification corresponds to “change notification”.

To achieve maximum performance, we will not use read-only operationslike query in production environments. Instead, we hide query operationsin write requests like port registration. If the request is successful,the current node will become the owner. If registration failed, thecurrent owner of the requested service will be returned. This has alsocompleted the read operations like owner query (service discovery/nameresolution).

Even a write operation (e.g., port registration) failed, it is stillaccompanied by a successful write operation. The reason is, there is aneed to add the current node that initiated the request into the changenotification list of specified item, in order to push notificationmessages to all interested nodes when a change such as portunregistration happens. So the write performance differences greatlyaffect the performance of an actual application.

From the high-performance cluster (HPC) perspective, as mentioned above,the biggest difference between PSS and the traditional distributedcoordination products (described above) is mainly reflected in thefollowing two aspects:

-   -   1. High performance: PSS eliminates the overhead of network        broadcasting, disk IO, add the batch support operations and        other optimizations. As a result, the overall performance of the        distributed coordination service has been increased by tens of        thousands of times (See section [0006], [0032]-[0037], and        [0060]).    -   2. High capacity: about 100 million message ports per 10 GB        memory, due to the rational use of the data structure such as        concurrent hash table, the capacity and processing performance        can be linearly scaled with the memory capacity, the number of        processor cores, the network card speed and other hardware        upgrades (See section [0060]).

Due to the performance and capacity limitations of traditionaldistributed coordination services, in a classical distributed cluster,the distributed coordination and scheduling unit is typically at theservice or node level. At the same time, the nodes in the cluster arerequired to operate in stateless mode as far as possible. The design ofservice node stateless has low requirement on distributed coordinationservice, but also brings the problem of low overall performance and soon.

PSS, on the other hand, can easily achieve the processing performance oftens of millions of requests per second, and tens of billions tohundreds of billions of message ports capacity. This provides a goodfoundation for the fine coordination of distributed clusters. Comparedwith the traditional stateless cluster, PSS-based fine collaborativeclusters can bring a huge overall performance improvement (See section[0006] and [0060]).

User and session management is the most common feature in almost allnetwork applications. We first take it as an example: In a statelesscluster, the online user does not have its owner server. Each time auser request arrives, it is routed randomly by the reverse proxy serviceto any node in the backend cluster. Although LVS, Nginx, HAProxy, TS andother mainstream reverse proxy server support node stickiness optionsbased on Cookie or IP, but because the nodes in the cluster arestateless, so the mechanism simply increases the probability thatrequests from the same client will be routed to a certain backend servernode and still cannot provide a guarantee of ownership. Therefore, itwill not be possible to achieve further optimizations.

While benefiting from PSS's outstanding performance and capacityguarantee, clusters based on PSS can be coordinated and scheduled at theuser level (i.e.: registering one port for each active user) to providebetter overall performance. The implementation steps are:

-   -   1. As with the traditional approach, when a user request arrives        at the reverse proxy service, the reverse proxy determines which        back-end server node the current request should be forwarded to        by the HTTP cookie, IP address, or related fields in the custom        protocol. If there is no sticky tag in the request, the        lowest-load node in the current back-end cluster is selected to        process the request.    -   2. After receiving the user request, the server node checks to        see if it is the owner of the requesting user by looking in the        local memory table.        -   a) If the current node is already the owner of the user, the            node continues processing the user request.        -   b) If the current node is not the owner of the user, it            initiates a RegPort request to PSS, attempting to become the            owner of the user. This request should be initiated in batch            mode to further improve network utilization and processing            efficiency.            -   i. If the RegPort request succeeds, the current node has                successfully acquired the user's ownership. The user                information can then be loaded from the backend database                into the local cache of the current node (which should                be optimized using bulk load) and continue processing                the user request.            -   ii. If the RegPort request fails, the specified user's                ownership currently belongs to another node. In this                case, the sticky field that the reverse proxy can                recognize, such as a cookie, should be reset and point                it to the correct owner node. Then notifies the reverse                proxy service or the client to retry.

Compared with traditional architectures, taking into account thestateless services also need to use MySQL, Memcached or Redis and othertechnologies to implement the user and session management mechanism, sothe above implementation does not add much complexity, but theperformance improvement is very large, as follows:

Item PSS HPC Traditional Stateless Cluster 1 Eliminating the deploymentand Need to implement and Op. maintenance costs of the user and sessionmaintain the user management management cluster. cluster separately, andprovides dedicated high-availability protection for the user and sessionmanagement service. Increases the number of fault points, the overallsystem complexity and the maintenance costs. 2 Nearly all user matchingand session It is necessary to send a query Net. verification tasks fora client request can request to the user and session be done directly inthe memory of its management service over the owner node. Memory accessis a network each time a user nanosecond operation, compared to identityand session validity is millisecond-level network query delay, requiredand wait for it to return performance increase of more than a result.Network load and the 100,000 times. While effectively reducing latencyis high. the network load in the server cluster. Because in a typicalnetwork application, most user requests need to first complete the useridentification and session authentication to continue processing, so itis a great impact on overall performance. 3 Because each active user hasa definite No dedicated owner server, user Cch. owner server at anygiven time, and the requests can be randomly user is always inclined torepeat access to dispatched to any node in the the same or similar dataover a certain server cluster; Local cache hit period of time (such astheir own rate is low; Repeatedly caching properties, the productinformation they more content in different nodes; have just submitted orviewed, and so on). Need to rely on the distributed As a result, theserver's local data caches cache at a higher cost. tend to have highlocality and high hit The read pressure of the rates. backend databaseserver is high. Compared with distributed caching, the Additionaloptimizations are advantages of local cache is very obvious: required,such as horizontal 1. Eliminates the network latency partitioning,vertical    required by query requests and partitioning, and read/write   reduces network load (See “Item 2” in separation.    this table fordetails). 2. Get the expanded data structures    directly from memory,without a lot of    data serialization and deserialization    work. Theserver's local cache hit rate can be further improved if the appropriaterules for user owner selection can be followed, for example: a) Groupusers by tenant (company,    department, site); b) Group users by region(geographical    location, map area in the game); c) Group users byinterest characteristics    (game team, product preference). And so on,and then try to assign users belonging to the same group to the sameserver node (or to the same set of nodes). Obviously, choice anappropriate user grouping strategy can greatly enhance the server node'slocal cache hit rate. This allows most of the data associated with auser or a group of users to be cached locally. This not only improvesthe overall performance of the cluster, but also eliminates thedependency on the distributed cache. The read pressure of the backenddatabase is also greatly reduced. 4 Due to the deterministic ownershipCumulative write optimization Upd. solution, any user can be ensured tobe and batch write optimization globally serviced by a particular ownercannot be implemented because node within a given time period in theeach request from the user may cluster. Coupled with the fact that thebe forwarded to a different probability of a sudden failure of a modernserver node for processing. The PC server is also very low. writepressure of the backend Thus, the frequently changing user database isvery high. properties with lower importance or A plurality of nodes maytimeliness can be cached in memory. The compete to update the same ownernode can update these changes to record simultaneously, further thedatabase in batches until they are increasing the burden on theaccumulated for a period of time. database. This can greatly reduce thewrite pressure Additional optimizations are of the backend database.required, such as horizontal For example, the shop system may collectpartitioning and vertical and record user preference information inpartitioning, However, these real time as the user browses (e.g., viewsoptimizations will also result in each product item). The workload ishigh side effects such as “need to if the system needs to immediatelyupdate implement distributed the database at each time a user views atransaction support at the new product. Also considering that due toapplication layer.” hardware failure, some users who occasionally losetheir last few hours of product browsing preference data are perfectlyacceptable. Thus, the changed data can be temporarily stored in thelocal cache of the owner node, and the database is updated in batchesevery few hours. Another example: In the MMORPG game, the user's currentlocation, status, experience and other data values are changing at anytime. The owner server can also accumulate these data changes in thelocal cache and update them to the database in batches at appropriateintervals (e.g.: every 5 minutes). This not only significantly reducesthe number of requests executed by the backend database, but alsoeliminates a significant amount of disk flushing by encapsulatingmultiple user data update requests into a single batch transaction,resulting in further efficiency improvements. In addition, updating userproperties through a dedicated owner node also avoids contention issueswhen multiple nodes are simultaneously updating the same object in astateless cluster. It further improves database performance. 5 Since allsessions initiated by the same Because different sessions of the Pushuser are managed centrally in the same same user are randomly ownernode, it is very convenient to push assigned to different nodes, aninstant notification message (Comet) to there is a need to develop, theuser. deploy, and maintain a If the object sending the message is on thespecialized message push same node as the recipient, the messagecluster. It also needs to be can be pushed directly to all activespecifically designed to ensure sessions belong to the recipient. thehigh performance and high Otherwise, the message may simply beavailability of the cluster. delivered to the owner node of the This notonly increases the recipient. Message delivery can be development andmaintenance implemented using PSS (send messages to costs, but alsoincreases the the corresponding port of the recipient internal networkload of the directly, should enable the batch message server cluster,because each sending mechanism to optimize). Of message needs to beforwarded course, it can also be done through a to the push servicebefore it can dedicated message middleware (e.g.: be sent to the client.The Kafka, RocketMQ, RabbitMQ, ZeroMQ, processing latency of the useretc.). request is also increased. If the user's ownership is grouped asdescribed in item 3 of this table, the probability of completing themessage push in the same node can be greatly improved. This cansignificantly reduce the communication between servers. Therefore, weencourage customizing the user grouping strategy based on the actualsituation for the business properly. A reasonable grouping strategy canachieve the desired effect, that is, most of the message push occursdirectly in the current server node. For example, for a gameapplication, group players by map object and place players within thesame map instance to the same owner node - Most of the message push inthe traditional MMORPG occurs between players within the same mapinstance (AOI). Another example: For CRM, HCM, ERP and other SaaSapplications, users can be grouped according to the company, place usersbelong to the same enterprise to the same owner node - It is clear thatfor such enterprise applications, nearly 100% of the communications arefrom within the enterprise members. The result is a near 100% localmessage push rate: the message delivery between servers can almost beeliminated. This significantly reduces the internal network load of theserver cluster. 6 Clusters can be scheduled using a If the nodestickiness option is Bal. combination of active and passive load enabledin the reverse proxy, its balancing. load balancing is comparable toPassive balancing: Each node in the the PSS cluster's passive clusterperiodically unloads users and balancing algorithm. sessions that are nolonger active, and If the node stickiness option in notifies the PSSservice to bulk release the the reverse proxy is not enabled,corresponding ports for those users. This its balance is less than thePSS algorithm implements a macro load active balance cluster whenbalancing (in the long term, clusters are recovering from a failure. Atthe balanced). same time, In order to ensure Active balancing: Thecluster selects the that the local cache hit rate and load balancingcoordinator node through other performance indicators are the PSSservice. This node continuously not too bad, the administrator monitorsthe load of each node in the usually does not disable the cluster andsends instructions for load node sticky function. scheduling (e.g.:request node A to transfer In addition, SOA architecture 5,000 usersowned by it to Node B). tends to imbalance between Unlike the passivebalancing at the macro multiple services, resulting in level, the activebalancing mechanism can some services overload, and be done in a shortertime slice with some light-load, μSOA cluster quicker response speed.without such shortcomings. Active balancing is usually effective whensome of the nodes in the cluster have just recovered from the failure(and therefore are in no-load state), it reacts more rapidly than thepassive balancing. For Example: In a cluster that spans multiple activeIDCs, an IDC resumes on-line when a cable fault has just been restored.

It is worth mentioning that such a precise collaborative algorithm doesnot cause any loss in availability of the cluster. Consider the casewhere a node in a cluster is down due to a failure: At this point, thePSS service will detect that the node is offline and automaticallyrelease all users belonging to that node. When one of its usersinitiates a new request to the cluster, the request will be routed tothe lightest node in the current cluster (See step 2-b-i in theforegoing). This process is transparent to the user and does not requireadditional processing logic in the client.

The above discussion shows the advantages of the PSS HPC cluster finecoordination capability, taking the user and session managementfunctions that are involved in almost all network applications as anexample. But in most real-world situations, the application does notjust include user management functions. In addition, applications ofteninclude other objects that can be manipulated by their users. Forexample, in Youku.com, tudou.com, youtube.com and other video sites, inaddition to the user, at least some “video objects” can be played bytheir users.

Here we take the “video object” as an example, to explore how the usethe fine scheduling capabilities of PSS to significantly enhance clusterperformance.

In this hypothetical video-on-demand application, similar to the usermanagement function described above, we first select an owner node foreach active video object through the PSS service. Secondly, we willdivide the properties of a video object into following two categories:

-   -   1. Common Properties: Contains properties that are less updated        and smaller in size. Such as video title, video introduction,        video tag, video author UID, video publication time, ID of the        video stream data stored in the object storage service (S3/OSS),        and the like. These properties are all consistent with the law        of “read more write less”, or even more, most of these fields        cannot be modified after the video is published.        -   For such small-size, less-changed fields, they can be            distributed in the local cache of each server node in the            current cluster. Local memory caches have advantages such as            high performance, low latency, and no need for            serialization, plus the smaller size of the objects in            cache. Combined with strategies to further enhance the cache            locality, such as user ownership grouping, the overall            performance can be improved effectively through a reasonable            memory overhead (see below).    -   2. Dynamic Properties: Contains all properties that need to be        changed frequently, or larger in size. Such as: video playback        times, “like” and “dislike” times, scores, number of favours,        number of comments, and contents of the discussion forum belong        to the video and so on.        -   We stipulate that such fields can only be accessed by the            owner of the video object. Other nodes need to send a            request to the corresponding owner to access these dynamic            attributes.        -   This means that we use the election mechanism provided by            PSS to hand over properties that require frequent changes            (updating the database and performing cache invalidation) or            requiring more memory (high cache cost) to the appropriate            owner node for management and maintenance. This result in a            highly efficient distributed computing and distributed            caching mechanism, greatly improving the overall performance            of the application (see below).

In addition, we also stipulate that any write operation to the videoobject (whether for common or dynamic properties) must be done by itsowner. A non-owner node can only read and cache the common properties ofa video object; it cannot read dynamic properties and cannot perform anyupdate operations.

Therefore, we can simply infer that the general logic of accessing avideo object is as follows:

-   -   1. When a common property read request arrives at the server        node, the local cache is checked. If the cache hit, then return        the results directly. Otherwise, the common part of the video        object is read from the backend database and added to the local        cache of current node.    -   2. When an update request or dynamic property read request        arrives, it checks whether the current node is the owner of the        corresponding video object through the local memory table.        -   a) If the current node is already the owner of the video,            the current node continues to process this user request: For            read operations, the result is returned directly from the            local cache of the current node; depending on the situation,            write operations are either accumulated in the local cache            or passed directly to the backend database (the local cache            is also updated simultaneously).        -   b) If the current node is not the owner of the video but            finds an entry matching the video in the local name            resolution cache table, it forwards the current request to            the corresponding owner node.        -   c) If the current node is not the owner of the video and            does not find the corresponding entry in the local name            resolution cache table, it initiates a RegPort request to            PSS and tries to become the owner of the video. This request            should be initiated in batch mode to further improve network            utilization and processing efficiency.            -   i. If the RegPort request succeeds, then the current                node has successfully acquired the ownership of the                video. At this point, the video information can be                loaded from the backend database into the local cache of                the current node (which should be optimized using bulk                loading) and continue processing the request.            -   ii. If the RegPort request fails, the specified video                object is already owned by another node. In this case,                the video and its corresponding owner ID are added to                the local name resolution cache table, and the request                is forwarded to the corresponding owner node for                processing.                -   Note: Because PSS can push notifications to all                    nodes that are interested in this event each time                    the port is unregistered (whether due to explicit                    ownership release, or due to node failure offline).                    So the name resolution cache table does not require                    a TTL timeout mechanism similar to the DNS cache. It                    only needs to delete the corresponding entry if the                    port deregistration notice is received or the LRU                    cache is full. This not only improves the timeliness                    and accuracy of entries in the lookup table, but                    also effectively reduces the number of RegPort                    requests that need to be sent, improving the overall                    performance of the application.

Compared with the classic stateless SOA cluster, the benefits of theabove design are as follows:

Item PSS HPC Traditional Stateless Cluster 1 The distributed cachestructure is based Distributed cache clusters need Op. on ownership, iteliminates the to be implemented and deployment and maintenance costs ofmaintained separately, increase distributed cache clusters such asoverall system complexity. Memcached and Redis. 2 A common property readoperation will No dedicated owner server, Cch. hit the local cache. Ifthe owner node user requests can be randomly selection strategy that“Group users dispatched to any node in the according to their preferenceserver cluster; Local cache hit characteristics” is used, then the cacherate is low; Repeatedly caching locality will be greatly enhanced. morecontent in different Furthermore, the local cache hit rate nodes; Needto rely on the will increase and the cache repetition distributed cacheat a higher rate in the different nodes of the cluster cost. willdecrease. The read pressure of the As mentioned earlier, compared tobackend database server is distributed cache, the local cache can high.Additional optimizations eliminate network latency, reduce are required,such as horizontal network load, avoid frequent partitioning, verticalserialization and deserialization of data partitioning, and read/writestructures, and so on. separation. In addition, dynamic properties areFurthermore, even the CAS implemented using distributed cache atomicoperation based on the based on ownership, which avoids the Revisionfield and other similar problems of frequent invalidation andimprovements can be added to data inconsistency of traditional theMemcached, Redis and distributed caches. At the same time, otherproducts. These because the dynamic properties are only independentdistributed cache cached on the owner node, the overall clusters stilldo not provide memory utilization of the system is also strongconsistency guarantees significantly improved. (i.e.: The data in thecache may not be consistent with the records in the backend database). 3Due to the deterministic ownership Cumulative write optimization Upd.solution, It is ensured that all write and and batch write optimizationdynamic property read operations of cannot be implemented video objectsare globally serviced by a because each request may be particular ownernode within a given forwarded to a different server time period in thecluster. Coupled with node for processing. The write the fact that theprobability of a sudden pressure of the backend failure of a modern PCserver is also database is very high. very low. A plurality of nodes mayThus, the frequently changing dynamic compete to update the sameproperties with lower importance or record simultaneously, furthertimeliness can be cached in memory. increasing the burden on the Theowner node can update these database. changes to the database in batchesuntil Additional optimizations are they are accumulated for a period ofrequired, such as horizontal time. partitioning and vertical This cangreatly reduce the write partitioning, However, these pressure of thebackend database. optimizations will also result in For example: thevideo playback times, side effects such as “need to “like” and “dislike”times, scores, implement distributed number of favours, references andother transaction support at the properties will be changed intensivelyapplication layer.” with every user clicks. If the system needs toupdate the database as soon as each associated click event is triggered,the workload is high. Also considering that due to hardware failure, theloss of a few minutes of the above statistics is completely acceptable.Thus, the changed data can be temporarily stored in the local cache ofthe owner node, and the database is updated in batches every fewminutes. This not only significantly reduces the number of requestsexecuted by the backend database, but also eliminates a significantamount of disk flushing by encapsulating multiple video data updaterequests into a single batch transaction, resulting in furtherefficiency improvements. In addition, updating video properties througha dedicated owner node also avoids contention issues when multiple nodesare simultaneously updating the same object in a stateless cluster. Itfurther improves database performance. 4 Clusters can be scheduled usinga When recovering from a fault, Bal. combination of active and passiveload the balance is less than the PSS balancing. active balancedcluster. Passive balancing: Each node in the However, there is nocluster periodically unloads videos that significant difference underare no longer active, and notifies the normal circumstances. PSS serviceto bulk release the In addition, SOA architecture corresponding portsfor those videos. tends to imbalance between This algorithm implements amacro load multiple services, resulting in balancing (in the long term,clusters are some services overload, and balanced). some light-load,μSOA cluster Active balancing: The cluster selects the without suchshortcomings. load balancing coordinator node through the PSS service.This node continuously monitors the load of each node in the cluster andsends instructions for load scheduling (e.g.: request node A to transfer10,000 videos owned by it to Node B). Unlike the passive balancing atthe macro level, the active balancing mechanism can be done in a shortertime slice with quicker response speed. Active balancing is usuallyeffective when some of the nodes in the cluster have just recovered fromthe failure (and therefore are in no-load state), it reacts more rapidlythan the passive balancing. For Example: In a cluster that spansmultiple active IDCs, an IDC resumes on-line when a cable fault has justbeen restored.

Similar to the previously mentioned user management case, the precisecollaboration algorithm described above does not result in any loss ofservice availability for the cluster. Consider the case where a node ina cluster is down due to a failure: At this point, the PSS service willdetect that the node is offline and automatically release all videosbelonging to that node. When a user accesses these video objects nexttime, the server node that received the request takes ownership of thevideo object from PSS and completes the request. At this point, the newnode will (replace the offline fault node) becomes the owner of thisvideo object (See step 2-c-i in the foregoing). This process istransparent to the user and does not require additional processing logicin the client.

The above analysis of “User Management” and “Video Services” is just anappetizer. In practical applications, the fine resource coordinationcapability provided by PSS through its high-performance, high-capacityfeatures can be applied to the Internet, telecommunications, Internet ofThings, big data processing, streaming computing and other fields.

To sum up, the port switch service is a message routing serviceintegrating distributed coordination functions such as fault detection,service electing, service discovery, and distributed lock. Bysacrificing the reliability under the extreme condition, the port switchservice disclosed by the present invention realizes very highperformance, capacity and concurrency capability in the premise ofensuring strong consistency, high availability and scalability(horizontal scaling).

1. A port switch service (Port Switch Service, PSS), comprising a servercluster and a client cluster, wherein a master node in the currentcluster is elected from the server cluster through a quorum algorithmand is guaranteed to be unique within a specified period in a leaseform; the client cluster contains various client nodes needing to usethe PSS, and each client node can establish connection with the masternode as needed; each of the client nodes is identified in the servercluster through the unique node ID; and the server cluster employs amode of one master node plus a plurality of salve nodes, or a mode ofone master node plus a plurality of slave nodes plus a plurality ofarbiter nodes, and all data is stored in the memory (RAM) of the masternode only (full-in-memory).
 2. The port switch service (PSS) accordingto claim 1, wherein each of the client nodes maintains at least one TCPKeep-Alive connection with the port switch service.
 3. The port switchservice (PSS) according to claim 2, wherein any number of ports can beregistered for each connection; the name of the port is described usinga UTF-8 character string and must be globally unique; and the portcontains a message caching queue and a port release notification list.4. The port switch service (PSS) according to claim 1, wherein in thatPSS offers the following application programming interface (API)primitives: Waiting for Message (WaitMsg), Relet, Port Registration(RegPort), Port Un-registration (UnRegPort), Message Sending (SendMsg),Port Query (QueryPort), Node Query (QueryNode) and Clear; the messageregistration primitive of the Port Registration permits that onecommunication request contains multiple port registration commandssimultaneously; the message un-registration primitive of the PortUn-registration permits that one communication request contains multipleport un-registration commands simultaneously; and the message sendingprimitive permits that one communication request contains multiplemessages simultaneously (batch message sending).
 5. The port switchservice (PSS) according to claim 4, wherein the connection of the clientcluster and port switch service includes message receiving connectionsand message sending connections; the message receiving connection (1:1)uses the WaitMsg method for the node registration and message pushing,keeps occupying all ports belong to current node using Relet, and usesthe Clear primitive to clean up before normal disconnection; each nodewithin the cluster should keep and only keep a single message receivingconnection, which is a Keep-Alive connection; the connection active isalways kept and Relet is completed in a timely manner, becausere-establishing a receiving connection will require service electingagain (port registration); with respect to the message sendingconnection (1:N): all connections that are not upgraded using WaitMsgAPI are deemed as sending connections, uses primitives like RegPort,UnRegPort, SendMsg and QueryPort for non-pushing requests, without theneed for using Relet to keep heartbeat, and does not need to use theClear command to clean up; and each node within the cluster maintains amessage sending connection pool, so that the worker threads can stay incommunication with the port switch service.
 6. The port switch service(PSS) according to claim 1, wherein the server cluster can be segmentedinto sub server clusters by name spaces, and the sub server clustersachieve horizontal scaling through a tree cascade structure; and each ofthe client nodes is registered on ports under a local name space and asuperior name space of the corresponding client node.